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Abstract 



u 
q 

We propose a proximal point algorithm to solve LAROS problem, that is the problem of finding 
a "large approximately rank-one submatrix" . This LAROS problem is used to sequentially extract 
features in data. We also develop a new stopping criterion for the proximal point algorithm, which 
is based on the duality conditions of e-optimal solutions of the LAROS problem, with a theoretical 
\^ • guarantee. We test our algorithm with two image databases and show that we can use the LAROS 



problem to extract appropriate common features from these images. 
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1 Introduction 

Feature extraction is an important application in information retrieval. For example, let us consider a 
• r- 1 . 

^ ■ matrix A £ IR™ xn that represents a database of pixelated and registered grayscale images which have 

h : 

the same size. Each column of A corresponds to one image and each row corresponds to a particular 
pixel position in those images. The value Aij is then the intensity of the ith. pixel in the jth image. A 
common visual feature represented by the pixels in J C {1, . . . , n}, which occur in a subset of images 
in X C {1, . . . , m}, can be associated with the approximately rank-one submatrix A(X, J) of the matrix 
A. We assume here the features are non-overlapping. If we want to more than one visual feature, 
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we can iteratively find an approximately rank-one submatrix, subtract it from A (perhaps modifying 
the result of the subtraction to ensure that A remains nonnegative) , and then repeat the procedure. 
Doan and Vavasis [3] proposed the LAROS problem which tries to find "large approximately rank-one 
submatrix". The proposed convex parametric formulation for the LAROS problem is written as follows: 

min ||-X"||e := ||-X"||* + $||^||i 
s.t. (A,X) = 1, 

where 9 > 0. Here ||^||* denotes the nuclear norm of X, which is defined to be the sum of the 
singular values of X, and ||-X"||i denotes the sum the absolute values of all the entries of X. Theoretical 
properties of LAROS problem have been developed in [3]. In this paper, we investigate algorithms to 
solve the problem and apply it to find features in data. We will focus on proximal point algorithmic 
framework, which have recently been studied for nuclear norm minimization (see Liu et al. [I] and 
references therein). 

Throughout the paper, we use || • || to denote either the Frobenius norm of a matrix or the Euclidean 
norm of a vector. The spectral norm of a matrix X is denoted by ||-X"||2- 

Proximal Point Algorithm 

The proximal point algorithm is based on the Moreau- Yoshida regularization of the (non-differentiable) 
convex optimization problem 

min0(£c), (2) 

where X is a finite-dimensional real Hilbert space and <j) '■ X — > (— oo, oo] is a proper, lower semicontin- 
uous, convex function. For an arbitrary A > 0, the regularization is defined as 

$\(x) = mm ( (f)(z) + —\\x - z\\ 2 ) , Vx G X. 
zex y 2A J 

The above optimization problem has a unique optimal solution p\(x) for all x G X, and p\ is called 
the proximal point mapping associated with eft. One of the most important properties of &\ and p\ is 
that the set of optimal solutions of ([2]) is exactly the set of optimal solutions of the following optimization 
problem: 

mm&\(x), (3) 

where &\ is now a continuously differentiable convex function defined on X with a globally Lipschitz 
continuous gradient V&\ (with modulus 1/A). The necessary and sufficient optimality condition of ([3]) 
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can then be expressed as follows: 

V$ x (x) = ^px(x) = x, (4) 

where p\ is a global Lipschitz continuous function with modulus 1. 

The proximal point algorithm is an iterative method to solve the problem (|2|) that uses the optimality 
condition written in In each iteration, x k+1 ~ p\ (x k ) according to a sequence {A&} of regularization 
parameters. The convergence of the algorithm has been studied by Rockafellar [6] in a more general 
setting of inclusion problems with maximal monotone operators. Note that the problem ([2]) is equivalent 
to the inclusion problem G dcp(x), where d<f> is a maximal monotone operator if (f> is a proper, lower 
semicontinuous, and convex function. We now ready to study the proximal point mapping for our 
particular problem. In order to apply the framework, we reformulate Problem (pQ) with a redundant 
variable as follows: 

min ||-X"i||* + 6*||X 2 ||i 

s.t. {A,X 1 ) = 1, (5) 
X\ = X 2 . 

In addition, to introduce more flexibility into our model, we study Problem (|5|) under the following 
more general setting: 

min + 0||X 2 ||i 

(6) 

s.t. A(X) -b€Q, X = (Xi,X 2 ) G K mxn x R mxn 

where b £ H, A : W nxn x M mxn — >. % is a given linear map, and Q is a pointed close convex cone in %. 
Here H. is a finite-dimensional Hilbert space. For the problem ©, we have H. = MxM mxn , Q = {0} x {0}, 
b = (1,0), and A{X) = {{A,Xi),Xi - X 2 ). Note that the adjoint A* : U -> W mxn x W mxn is given 
by A*z = (ziA + Z 2 , -Z 2 ) for any z = (zi, Z 2 ) G 

2 Primal Proximal Point Algorithm 

We define the function 6 as follows: 



(*) = 



(7) 



||-X"i||* + 0||X2||i, X — {X\, X 2 ) G J 7 , 
+00, otherwise, 

where J 7 is the feasible set of the problem ([6]) . The problem Q is then equivalent to the optimization 
problem 

min d>(X), 



where X - 

We now introduce dual decision variables z € W and define the Lagrangian function L(X, z), 

+0||X 2 ||i + (z,b-A(X)) if z G Q* 

(8) 

-oo otherwise 

where Q* is the dual cone of Q defined by Q* = {y € H : (y, z) > 0, Vz G Q}. For our problem, Q* 
is simply the whole space, Q* = % = R x K mXn . Clearly, </>(X) = sup L(X,z). We now calculate the 
Moreau-Yoshida regularization of <^>: 

d> A (X) = min ^(V) + - V|| 2 ) . (9) 

Applying the strong duality (or minimax theory) result in Rockafellar [5], we have: 

$x(X)= min sup (l(Y,z) + \\\X - V\\ 2 
Vex zeH V 2A 

= sup min f L(V,z) + - Fll 2 

= sup min (\\V 1 \U+e\\V 2 \\ 1 + (z,b-A(V)) + ^-\\X-V\\ 2 

= sup (z,6) + 7^r||X|| 2 - ^-\\X + A„4*z|| 2 + min (||Vi||* + 0||V 2 ||i + -U|V - (X + A.4*z)|| 2 
zgQ* zA zA VzlX y zA 

Now, consider the first inner minimization problem, we have: 

min (j| VxlU + 0|| V2H! + V - (X + XA**)II 2 ) (10) 

= min f ||Vi||„ + ^||Vi - (Xi + ABi*)IN + 0mm f ||V 2 ||i + ^||V 2 - (X 2 + XB 2 z)\\ 2 

where we have written A*z = (Biz, B 2 z) S X. The first optimization problem on the right-hand side is 
the Moreau-Yoshida regularization of the nuclear norm function at Xi + XB±z, and the problem has an 
analytical solution given by 

+ XB x z) = £/Diag(max{^ - X,0})V T , (11) 

which is computable from the singular value decomposition, X\ + XB\z = UHV T . In addition, the 
minimal objective value is given by 

l.\\X 1 + XB lZ \\ 2 - ^-\\p^(X 1 + XB lZ )f. 

Next, we consider the second inner minimization problem on the right-hand side of (|10p. This opti- 
mization problem is the Moreau-Yoshida regularization of the Zi-norm function (with parameter X9) at 



X 2 + XB 2 z, and it has the following analytical solution: 

pf}{X 2 + XB 2 z) = sgn(X 2 + XB 2 z) o max{|X 2 + XB 2 z\ - 9X,0}, (12) 

where o is the Hadamard product (or entrywise product) and sgn is the (entrywise) sign function. The 
corresponding minimal objective value is given by 



^\\X 2 + XB 2 z\\ 2 - 2^||p$(X 2 + XB 2 z)\\ 2 . 



Combining these two results, we can compute <fr\(X) as follows: 

1 """2 . _ ( , ,\ 1 ||„(1W , xe_M|2 1 ,, . x „s,2 



$ A (X) = — + sup Uz, b) - ^\\P\'(X 1 + XB lZ )f - 2X||P^(X 2 + ABa^ir J . (13) 
where p^ and p A ^ are defined in (fTTj) and (fT2j) respectively. Now define 

6 A (X,z) = (z,6) - ±-^\ Xl + XBlz) f _ -L||pS5(X 2 + AS 2 z)|| 2 (14) 



and consider 

z\(X) G arg sup 8a(X,z). 

Applying the saddle point theorem in Rockafellar [5], we obtain the proximal point mapping associated 
with (j> as follows: 

Px (X) = (p A 1} (X a + XB lZx (X)), pg(X 2 + AS 2 z A (X))) . (15) 
The primal proximal point algorithm (primal PPA) has the following template. 



The Primal PPA. Given X° G Af, A > and e > 0, perform 


the following loop: 




Step 1. Find an (approximate) optimal solution 






z k G arg sup 6 Afc (X fc ,z), 

zeQ* 




(16) 


where G Afc is defined in (fl4~|). 






Step 2. Update 








(A3 + A£ 2 z fc ) 


(17) 


according to the proximal point mapping in (jl5[). 






Step 3. If - X k |/Afe < e, stop; else, update A fc , end 
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3 Dual Proximal Point Algorithm 

The dual problem associated with ([6|) is given as follows: 



maxff(y) (18) 
yen 



where g is the concave function defined by 



inf{||Xi||* + B\\X 2 \\i + {y, b -A(X)) : X = (X 1 ,X 2 )gX} if y G Q* 
g(y) = { (19) 
— oo otherwise 

The Moreau-Yoshida regularization of g is given by 
G x {y) := max{g(z) - -^-\\z - y\\ 2 } 

zGrt LA 

= max inf {||X 1 ||,+0||X 2 || 1 + (z,6-^(X))-i-||z-y|| 2 } 

= inf max{||X 1 ||, + ^||X 2 || 1 + (z,6-^(X))-i-||z-2/|| 2 } 
X£Xz&Q* 2A 

= ~||i/f + ^{11X111, + e||X 2 ||i + * A (X;i/)} (20) 

where 

* A (X; y) = ^||n 2 *(y + A(6 - A(X)))\\ 2 . (21) 

Note that Vx^\(X;y) = —A*H.Q*{y + X(b - A(X))). The dual algorithm can then be written as 
follows. 



The Dual PPA. Given a tolerance e > 0. Input y° G Q* and A > 0. 


Set k := 0. Iterate: 




Step 1. Find an approximate minimizer 






X fc «arg inf + 0||X 2 ||i + *A fc (X; 




(22) 


where \l/ Afc (X;?/ ) is defined as in (|2TT). 






Step 2. Compute 

/+ 1 = n Q ,[/ + A fc (6-.4(X fc ))]. 




(23) 


Step 3. If \\{y k — y k+1 )/\k\\ < e; stop; else; update A& ; end. 
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4 Implementation Issues 

4.1 Primal proximal point algorithm 

For the primal PPA, the most important issue we have to address first is how to solve the inner 
problem sup @\(X,z). We have that Q\ is a concave function in z due to the linearity in z of the 

z£Q* 

Lagrangian function L. From the general gradient formulation X7&\(x) = — (x — p\(x)), we have that 

A 

V||p^(Xj)|| 2 = p^(Xi), i = l,2. Thus ®\ is continuously differentiable with 

V z O x (X,z) = b-B{p { t\x 1 +\B 1 z)-Blpf}{X 2 + \B 2 z). (24) 

Note that for the problem ©, we have B x z = Az x + Z 2 , B 2 z = -Z 2 for z = (z\, Z 2 ) £ R x R mxn . 
Correspondingly, we have = ({A,Xi),Xi) and B* 2 {X 2 ) = (0, -X 2 ) for any X X ,X 2 £ M mxn 

and 

V,6 A (X,z) = (l - (A,p^(X 1 + A(Az! + Z 2 ))), p { 3(X 2 - XZ 2 ) -p ( i\x 1 + \{Az x + Z 2 ))). (25) 

In addition, using the global Lipschitz continuity (with modulus 1) of two proximal point mappings, p\^ 
and p*xe , we can show that the gradient V Z Q\ is globally Lipschitz continuous with modulus A(|| A|||+2). 

With all these properties of 7 , we can solve the inner problem using first-order gradient-based 
methods such as steepest descent method. 

The second issue is that these inner problems are typically only solved approximately which results 
in inexact proximal point mappings. For inexact proximal point method, Rockafellar [B] provides two 
convergence criteria for global and local convergence. Based on the aforementioned convergence criteria, 
Liu et al. [1] have proposed some checkable stopping criteria for the inner problems to maintain global 
(and local) convergence of the proposed (inexact) proximal point method (for nuclear norm minimization 
problems). We can extend these stopping criteria for our problem. 

The third issues is to calculate a partial singular value decomposition in order to compute the 
proximal point mapping of the nuclear norm function (the computation of the proximal point mapping 
of the Zi-norm function is straightforward). As in Liu et al. [3], we use a Lanczos bidiagonalization 
algorithm with partial reorthogonalization to compute a partial singular value decomposition. We also 
need heuristics to set the number of singular values required to be computed with this algorithm. 
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4.2 Dual proximal point algorithm 

For the dual PPA, we need to look for a method to solve the inner problem inf < 1 1 JSC ill* + ^N-XTo 111 + 

X£X I" 

fy\ k (X; y fc )|. Similar to the approach proposed in Liu et al. 0], we will apply the accelerated prox- 
imal gradient algorithm pQ for this problem. According to Toh and Yun [8], we solve the problem 
mmP(X) + f(X), where P(X) = ||Xi||* + 9\\X 2 \\i and f(X) = V x (X;y). We have that the 
gradient V x 1 ^ x(X;y) = —A*Hq*(h + X(b — A(X))) is globally Lipschitz continuous with modulus 
L = A(||A||| + 2). The proximal gradient algorithm in each iteration needs to solve the following 
quadratic approximation of the sum P(X) + f(X) at the current solution Y: 

Q t (X; Y) = P{X) + /(F) + (V/(F, X-Y) + t -\\X- Y\\ 2 F 
= P(X) + l\\X- G t {Y)f F + f(Y) - I||V/0n|H, 

where Gt(Y) = Y — —\7f(Y). This function is a strongly convex function in X and hence it has a 
unique minimizer St(Y). We have that 

P(X) + l\\X- G t {Y)f F = \\X X \1 + e\\X 2 \\ x + t - G\{Y)\\ 2 F + \\X 2 - G 2 (Y)\\ 2 F ) 

1-XTilU + - G\{Y)\\l} + (9\\X 2 \\ X + i\\X 2 - Gf(y)|||) , 

where G t {Y) = {G\{Y),G 2 t {Y)). Thus the minimizer S t {Y) = {S}{Y), Sf{Y)), where S}(Y) is the 
minimizer of the problem min M|Xi||* H — \\Xi — Gl(Y)\\ F ) and S 2 (Y) is the minimizer of the op- 

timization problem min 0||X 2 ||i + -\X 2 - G 2 t (Y)\\ 2 F . Similar to the previous section, the analytical 
x.2 2 

solutions for these two optimization problems can be calculated and they are: 

Sl{Y)=pfl{G\(Y)), S 2 (Y)=p ( X(G 2 (Y)). (26) 

Finally, the proximal gradient algorithm for our problem can be described as follows. Given tq = t_i = 1 
and X° = X -1 , each iteration includes the following steps 

1. Calculate Y k = X k + Tfc ~ 1 ~ 1 ( X k - X k ~ l 



2. Update X k+1 = S t k(Y k ) according the formulas in flU). 



3. Update r k+1 = ^ [J 1 + At 2 + 1 

The update of in the third step is to make sure that r| , 1 — tj. + i < t| and t^+i > 1, a convergence 
condition of the proximal gradient algorithm. We also need to have the update rule for the remaining 



parameter t)~, which affects the quadratic approximation of the function / at Y. Since the gradient 
V/(V) is Lipschitz continuous with modulus L, for all t > L, we have: 



In order to have a better approximation, we would like to have smaller t and in the accelerated proximal 
gradient algorithm, we will use line search to find t/. < L such that the above condition is still satisfied, 
starting with t\ = L. More details can be found in Toh and Yun [8]. 

5 Sparse Structure of Rank-One Optimal Solutions 

The proposed proximal point algorithm is a first-order iterative method, which normally does not have 
fast convergence. Applying duality results obtained by Doan and Vavasis [3] for Problem ([!]), we would 
like to study better stopping criteria for the proposed proximal point algorithms. We focus on the case 
when Problem (JT]) has a rank-one optimal solution X = auv T since rank-one optimal solutions are what 
we are looking for in general. The purpose of the termination test is to obtain the correct supports of 
u and v, that is, the positions of their nonzero entries with a guarantee certificate when we only have 
approximate values for u and v from the proposed first-order algorithm. Although the technique in this 
section is developed for Problem ([I]), similar ideas can be applied to other proposed formulations with 
the nuclear norm such as the matrix completion problem. In particular, a test like this for the matrix 
completion problem can be used to rigorously establish the correct rank of the optimal solution from 
approximate solutions obtained from a first-order method. 

Now let us consider the rank-one optimal solution X of the following form 



where U\ > is a unit vector in R M , M < m, and V\ > is a unit vector in M. N , N < n. If U\ 
and v\ are determined, a\ can be easily calculated to satisfy the optimality condition ||-X"||e = 1 (we 
assume here i / 0). Note that in general, the rank-one optimal solution X could have a different 
block structure. However, without loss of generality, we can assume that u\v\ forms an upper left 
principal submatrix of X for ease of exposition. Under this assumption, we can set u = [«i;0] £ M. m 
and v = [vi;0] £ M. n with a = a±. Similar to Theorem 5 in Doan and Vavasis [3]j we can then write 
the optimality conditions as follows: 



P(S t (Y)) + f(S t (Y)) < Qt(S t (Y); Y). 
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There exists W G R mxn and V G R mxn such that 

A = \\A\\*(uv T + W) + e\\A\\*V (27) 

||W|| 2 <1, W T u = 0, Wv = 0, V u = E MxN , ||V||oo < 1, 

where EmxN is the M X N matrix of all ones. 

Letting A = 1/||A||2 and splitting all matrices into four subblocks according to the sparse structure 
of X, we obtain the following detailed optimality conditions: 

(l,t*i,«i) is a singular triple of AAn — OVn, and Wn = (AAn — 9Vn) — u\vj (28) 

Wis = AAi 2 - 9Vi2, W\ 2 u x = 0, and ||Vi 2 ||oo < 1 (29) 

W 2 i = AA 21 - 6V 2 i, W 2 iV! = 0, and ||V 2 l||oo < 1 (30) 

W 22 = AA 22 - 9V 22 , and ||V 22 |U < 1 (31) 

||W|| 2 < 1. (32) 

The following lemma shows how to find u±, v\ and A (or ||A||g) from the first optimality condition. 

Lemma 1. // (A, iti,t>i) satisfies A28\) , then x = {\,ui,V\) is a solution of the following system of 
nonlinear equations 



P(x) 



0. (33) 



( (AAn — QV\\)v\ — u\ N 
(XA 11 -9V u ) T u 1 -v 1 

y ujui - 1 j 

Proof. It is easily to see that vjvi = vJ(XAn — 9Vn) T ui = uju\ = 1 and the first two equations 
indicate that (l,iti,vi) is a singular triple of AAn — OVn. □ 
The system of equations in (|33|) has M + N + 1 variables and M + N + 1 equations, which can be 
solved using Newton method. One of the convergence results of the Newton's method is the Kantorovich 
theorem, which is given as follows (see Tapia [7]). 

Theorem 1 (Kantorovich). Assume that P is defined and is Frechet differentiable at each point in a 
given open convex set Dq and for some Xq G Dq that [P^xq)]" 1 exists and that 

(i) \\[p'( Xo )]-m<B, 

(ii) IHP'^o)]-^^)!! <T), and 
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(Hi) \\P'(x) — P'(y)\\ < K\\x — y\\, for all x and y in Dq, 
with h = BKrj < —. 



Let fi* = {x I [I sb — osq II — where t* = ( — ^ ri. Now if $7* C Dq, then the Newton 

\ h J 

iterates, aifc+i — x k ~ [P'( x k)] P( x k)> are well defined, remain in f2*, and converge to x* € $7* such 
that P(x*) = 0. In addition, 



i * ii ^ V 
\x - x k \\ < - 

h 



{\-^/T^2h) 2k 

2^ 



k = 0,1,2, 



According to Theorem [TJ if we can find xq with the corresponding parameter h < 1/2, then for an 
arbitrary e > 0, an e-solution x such that \\x — x*\\ < e, can be achieved after a finite number of Newton 
iterations. Now assuming that we have obtained an e-solution (A, Ui,vi) of the system of equations 
in (|33|) . we would like to characterize the sufficient conditions which guarantee that the corresponding 
solution (A*, it*, v^) defines the optimal solution X of (PQ) as described above. The following proposition 
shows these sufficient conditions. 

Proposition 1. Consider an e-solution (A, u±,Vi) of the system of equations in \33\) , < e < 1/2. The 
corresponding solution (A* ,u\,v*) defines the rank-one optimal solution X* , 



X* 



alu\{v\) T 0' 
0, 



of (CP if there exist W and V that satisfy the following conditions: 

(i) Wit = (AAn - 6Vu) - uxvj and V u = E MxN , 

(ii) Wi 2 = AA 12 - 9V 12 , Wj 2 ux = 0, and \\Vul\oo < 1 " ^(II^IU + 5)e, 
(Hi) W 2l = \A 21 - 9V 21 , Wzivi = 0, and ||V 2 i||oo < 1 - _1 (ll^2i||oo + 5)e, 

(iv) W 22 = XA 22 - 0V 22 , and ||V 2 2||oo < 1. and 

(v) \\W\\ 2 < 1- (||A|| 2 + 7.5)e. 

Remark 1. In order to test stopping conditions specified in Proposition [7J we need to start with an e- 
approximate of the optimal solution (A*, tt*, vj), where u\ and v\ are unit vectors. It is therefore better 
to solve the problem where A* = 1/||A||S has the same magnitude as entries of u\ and v*. Heuristically, 
we could scale A so that ||A|| 2 = 1 to (partially) control the magnitude of A* . 
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Proof. Suppose we are given W and V which satisfy the conditions (i)-(v). We will construct 
W* and V* from W and V and prove that they satisfy all optimality conditions in (|28p -( |32l) when 
combining with the solution (X* ,Ui,v*) of ()33|) . We start with the (1, 1) subblock . Clearly, we need 
V* n = V u = EmxN and W* n = (A* An - OV^) - u\{v\) T . We have: 

W* n - W n = (A* - A)An - KK) T - u lV J). 

Since (A, tti,vi) is an e-solution, we have that max{|AA|, || Atti||, || A«i||} < e, where AA = A — A*, 
Aui = ui — u\, and Ai>i = v\ — v\. Hence 

\\ u i( v i) T ~ = \\u\(v\) T - {u\ + /\ui)(v\ + Aui) T || 

= ||Aui(t>^) T + ^A^ + A«iA«f|| < 2e + e 2 , 

since \\u\\\ = \\v\\\ = 1. 

We continue with the (2, 2) subblock. Let V* 22 = V 22 and W 22 = X*A 22 - 0V 22 , we have: 

W* 22 - W 22 = (A* - A)A 22 . 

Now consider the (1,2) subblock, we would like to construct W\ 2 that is close to W\ 2 and satisfies 

the condition that (W\ 2 ) T u\ = 0. We will use appropriate Householder matrices to construct W\ 2 

as follows. For two different unit vectors x and y, the Householder matrix Q = I — 2zz T with 
x — 1 j 

z = ±- rr- transforms x to y and vice versa. In other words, Qx = y and Qy = x. The 

\\ x - y\\ 

Householder matrix Q is symmetric and orthonormal. Now consider U\ = Note that since 

\\u\ || = 1 and || A«i|| < e, we have that |||«i|| — 1| < e, which implies || Aiii|| < 2e, where Au± = u± — u*. 
u* -\- U\ 

We define x = — - — — ^ and consider two Householder matrices, Q 1 and Q 2 , which transform u\ 

ll^i ^1 II 

to x and x to Wj, respectively. Let us define 

u\ + ui „ u\ + Ui 

Wl = Ui + T _ , w 2 =u 1 + 

\\ u l + u lll \\ u l + u l\\ 

then Qi and Q 2 can be constructed with z\ = u?i/||u?i|| and z 2 = w 2 /\\w 2 \\, respectively. We have 



T 

■"ill/ V " ll w l+™l| 



w{w x = [ui + I _ Ui+ H , . I, = 2 + l|iti + mn, 



or ||wi|| = \/2 + \\u* + t*i||. Similarly, we can also show that 1 1 1 1 = || w l|| = ^/2 + \\ul + ui||. Thus 
we have 

1 

Azi = Z\ — z<? = — Au\ 
y/2 + [Jttf+fiil] 



12 



1 2 
Hence IIAzill < =11 Atxill < — F=e- 

Now consider Q l2 = Q 2 Q\, we have: Qu\ = u* and Q is also an orthonormal matrix. Define 

= Q12W12) we have: W* 2 satisfies the condition {W\ 2 ) T u\ = since W\ 2 u x = 0. We can then 

select V\ 2 = (A*Ai2 - W\ 2 )/9. Thus 

W* 12 - W 12 = (Q 12 - I)W 12 , V* 12 -V X2 = - e [(A* - \)A l2 - (Q 12 - I)W 12 ] . 

We have 

Q 12 - I = (J - 2Z2-Z2) ( 7 - 2zizf) - I = 2z 2 Az\ - 2Az x z[ + &(z 2 Az{)z 2 z\. 

Thus 

i||Q 12 - I|| 2 = 2||A^ 1 || 2 + 2(z 2 r Az 1 )(zf A Zl ). 
Since Z\ and z 2 are unit vectors, we have: 

HOia-III < 4||Azi||. 

The final (2, 1) subblock can be analyzed similarly. We would like to find W 21 close to W 2 \ such that 
W^i^i = 0- We can define v\, y 1 , and y 2 , and Q 2 i i n a similar way to u\, z%, and z 2 , and Q 12 . We 
then have W* 2l = W21Q21 and V21 = (A*A 2i - W 21 )/9. We also obtain 

A?/i = — ADi, 

V2 + ii«; + «iII 

and 

||Q 21 -I|| <4||A yi ||. 

Finally, we need to prove ||V*||oo < 1 and ||W|| 2 < 1. By noting that ||(<3 12 — I)Wi2||oo < IKQ12 — 
I)Wi 2 \\ 2 < \\Q 12 - IH2IIW12II2 < HQ12 - f || and ||Azi|| < 2/V3e, we have 

II ^12 II 00 — || V12 1| 00 < 11^12 ~~ V"i2||oo 

< e- 1 OAAHIAizlloo + ||(Q 12 - I)W 12 \U 
<0- 1 [|AA|||A 12 |U + ||Q 12 -/||] 
<0- 1 [|AA|||Ai 2 ||oo + 4||Azi||] 
<e- 1 [\\A 12 \\ 00 + 5}e. 



Thus 



12II00 



< \\V 12 \\ 00 + 9' 1 [\\A 12 \\ 00 + 5]e< 1. 
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Similarly, we also have 

11^21 Woo < ||V2i||oo + e _1 [ll^2i||oo + 5]e< 1. 



Now consider W* 2 . Clearly ||W*|| 2 < ||W|| 2 + \\W* - W\\ 2 . We have 
W* -W = (X* - A) 



[An 


^ 












A 22 J 





(Q 12 - I)W 12 \ _ lul{v* t ) T -u lV T 

2^-/) \ 0, 



Thus we have: 



|| W* - W\\ 2 < |AA| max{||An|| 2 , ||A 22 || 2 } + max{||(Q 12 - I)W 12 h, ||W 21 (Q 21 - I)|| 2 } 
+ \\u* l (v* l ) T - uivj\\ 2 

< max{||An|| 2 , ||A 22 || 2 }e + 5e + 2e + e 2 

< [||A|| 2 + 7.5]e, 

since max{|| An|| 2 , ||A 22 || 2 } < ||A|| 2 and < e < 1/2. Thus 

||W*||2 < ||W|| 2 + [||A|| 2 + 7.5]e< 1. 

We have constructed W* and V* that satisfy the optimality condition for X*, which implies that X* 
is indeed an optimal solution of Problem ([T|). □ 
Proposition Q] show that given an e-solution (X,Ui,Vi) of (|33p . which for example, can be obtained 
from the current solution (X^jY^) of the proximal point algorithm, if we could find W and V that 
satisfy the e-optimality conditions given in Proposition [H then we can stop the algorithm with an 
accurate rank-one solution for Problem ([T]). Given (A, u±, vi). Let us consider the following optimization 
problem: 

min ||W|| 2 

s.t. W11 = (AAn - 6E MxN ) - u x vj, 
Wj 2 ui = 0, 

W21V1 = 0, (34) 
\\W l2 - AA^IU < 9 - (HA12II00 + 5)e, 
\\W21 - AA 2 i||oo <0- (HA21II00 + 5)c, 

||W22-AA22||oo<^ 

Clearly, if we could find a feasible solution of Problem (|34p with the objective || W|| 2 < 1 — (|| A|| 2 + 
7.5)e, then the e-optimality conditions for (A, u\,vi) are satisfied. This problem is a non-smooth convex 
constrained optimization problem and our main purpose is to find a feasible solution with the objective 
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value that is small enough. Therefore, we can simply apply the projected subgradient method to solve 
it. The projected subgradient method uses the iteration 

W k+l = n w [W k - a k G k ] , 

where G k £ <9||"H^fe||2 is a subgradient of ||.||2 at W k and W is the feasible set of Problem (pl|) . The 
step size a k can be chosen as one of the standard step sizes of the general subgradient method. For 
this problem, we choose a k = O According to Zi§tak [9], we can always select G k = u k v-[ € 

^ll^fclb, where (u k ,v k ) is the singular vectors corresponding to the largest singular value of W k . We 
now consider the projection problem II>v(W r ): 

n>v(W") £ argmin ||W-W"|||i 

s.t. W n = (AA U - OEmxn) - «i«T, 
Wj 2Ul = 0, 

W 2 iv 1 = 0, (35) 
||Wi 2 - AA12II00 <0- (HAialloo + 5)e, 
||W 2 i - AA21IU <e- (HA21II00 + 5)e, 
HVF22-AA22II00 <0. 

The objective function \\W — W\\"^ is element-wise separable; therefore, Problem (|35|) is block- wise 
separable. For the (1, 1) subblock, we have the fixed solution W\\ = (AAn — OE^^n) — uivj. For 
the (2, 2) subblock, it is a simple element-wise separable optimization problem: 

min 11^22-^2211^ 
s.t. \\W22 -AA22IU < 0, 
whose optimal solution can be computed as follows: 

W22 = max {min { W22, AA 22 + 0} , AA 22 - 0} . 

For the (1,2) subblock, the corresponding optimization problem is column-wise separable: 

min ||Wi2 — W^Hf* 
s.t. Wj 2 ui = 0, 

HW12 - AA 12 ||oo <e- (||Ai2||oo + 5)e. 
Each subproblem is a quadratic knapsack problem which can be written as follows: 

min \\w — w\\ 2 
s.t. ujw = 0, 
s.t. I < w < u. 
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According to Brucker [2], there is an 0(n) algorithm for these quadratic knapsack problems. Thus we 
can find W12 efficiently. Similarly, the (2, 1) subblock can be found by solving a number of quadratic 
knapsack problems since the corresponding optimization problem for it is row-wise separable. 

6 Numerical Examples 

6.1 Sailboat Bitmap Image Example 

In this example, we use a 80-by-50 black-and-white bitmap image of a sailboat. There are 5 distinct 
components or non-overlapping features in this image: left sail (Feature 1), sail mast (Feature 2), right 
sail (Feature 3), hull (Feature 4), and rudder (Feature 5). The bitmap image of the sailboat is shown 
in Figure [TJ A matrix A is created with 30 columns, each of which represents a bitmap image of the 
sailboat with just 3 out of 5 features. The matrix A therefore has 5 rank-one submatrices composed of 
all ones since the bitmap image is black-and-white. The structure of matrix A and all the features (in 
terms of non-zero elements) are shown in Figure [2) We would like to use our proposed formulation to 
extract these rank-one submatrices. 




Figure 1: Bitmap image of the complete sailboat 

Our main task in this example is to use our proposed formulation to extract the features from the 
matrix A. We have developed two algorithms to solve Problem ([1]), the primal and dual. For these 
numerical examples, we will use the dual algorithm mainly due to its accuracy with respect to optimal 
solutions. This superior accuracy could be explained by the fact that the subproblem we solved in 
each iteration of the dual algorithm is similar to the original problem. The stopping criterion stated 
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30 incomplete images 



5 non-overlapping features 



15 20 25 30 



Figure 2: Collection of 30 incomplete images of the sailboat and its 5 components 

in Proposition [1] is implemented. We test the conditions of Kantorovich's theorem, solve the system 
of equations in (|33p to obtain an e-solution, and then use the projected subgradient method to find a 
feasible solution W of Problem (|34p . Since the projection problem with W = can be considered as 
a relaxation of Problem (|34p in which the spectral norm is replaced by the Frobenius norm, we will 
start the projected subgradient method with Wo = 0. The stopping criteria for this method are the 
condition ||W||2 < 1 — (||A||2 + 7.5)e, the maximum number of iterations, and the change in objective 
values. As the testing process is computationally quite expensive, therefore we only use it once per a 
fixed number (say, 10) of outer iterations. 

For each value of the parameter 9, we will obtain the optimal solution X\ and Xlj- The decision 
variable X<i corresponds to the /i-norm part of the objective function; therefore, we use the sparsity 
structure of X\ to construct the final solution X* F with the elements of X^- According to Theorem 5 in 
[3] , the optimal solution of Problem dU) indicates the exact sparsity structure of the rank-one submatrix 
(even under small random noise) with appropriate 9. Therefore, in this experiment, we extract the 
rank-one approximation of X* F and use its sparsity structure as the sparsity structure of the extracted 
feature. Next, we need to select an appropriate value for 9. For 9~0, the algorithm returns the rank- 
one approximation of matrix A, which is an average of all features and for the purpose of extracting 
single features, this averaging effect is not desirable. On the other hand, we prefer large submatrices 
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(large features) over small ones. Similar to the L-curve method used to select a regularization parameter, 
we construct the curve £ := {\\X(M e , Ng)\\ F , \\A(M e ,N e ) - X(M e ,N e )\\ F , 9 > 0}, where {Mg,N e ) is 
the sparsity structure obtained from the algorithm and X(Mg,Ng) is the rank-one approximation of 
A(Mg, Ng). We then pick 9 that balances the feature largeness, ||X(Mg, Ng)\\p, and feature averaging 
measure \\A(M e ,N e ) - X(M e ,Ng)\\ F . After selecting (9, we obtain X(M e ,N e ) = u(M e )v(N e ) T , where 
max(v(N0)) = 1. The vector u(Mg) represents the extracted feature and v(Nq) indicates how significant 
the feature is in each boat image. After extracting a feature, we remove the feature from the image by 
setting A(Mg,Ng) = and continue to find new (non-overlapping) features. This method for choosing 
9 is clearly just a heuristic and a more concrete approach for 9 selection is still an important issue for 
future research. 

We are now ready to run our algorithm on this sailboat example. We set the main tolerance to 
be e = 10~ 6 , the maximum number of iterations to be 1000, and for each subproblem, the maximum 
number of iterations is set to be 30. There is also the parameter A of the proximal point framework 
that we need to select. This parameter controls the convergence of the algorithm and for this example, 
A = 0(1/6) works well most of the time. We can always adjust A (and number of iterations) to get 
better convergence if the initial setting does not achieve the tolerance required. We set e s = 10~ 10 as 
the tolerance used in Newton's method to test the additional stopping criterion. Finally, the values of 
9 are selected uniformly from three different ranges, small range [0.01,0.1], medium range [0.1,1], and 
large range [1, 10], 10 values in each range. 

We start with the matrix A. Except for the first value of {9 = 0.01), all other values result in 
the same rank-one submatrix of A, which means || A(Mq, Ng) — X(Mg,Ng)\\ = 0. Thus we do not need 
to use the curve C and just need to pick any value of 9 > 0.01. The vector v(Ng) is a zero-one vector 
indicating that either the feature u(Mg) appears completely in an image or it does not appear at all. 
The feature u(Mg) represents the exact combination of Feature 1 and 4, which is the left sail and the 
hull. 

We now exclude the first extracted feature from all the images and continue to find new (non- 
overlapping) features. Table Q] shows all the features that we obtain with the size of the features (sj), 
number of images that share each feature (rii), and their description. 

The results show that our algorithm can pick out the large common features that are inherent in 
the structure of the image set. For example, a combination of our defined features is indeed a large 
common feature if there are enough images that share that combination of features. 

To end this section, we would like to comment on the efficiency of the additional stopping criterion 
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Figure 3: First extracted feature: the combination of left sail and hull 
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Description 


1 


710 


15 


Left sail and hull 


2 


694 


6 


Right sail and hull 


3 


252 


10 


Right sail 


4 


156 


5 


Sail mast and rudder 


5 


268 


7 


Left sail 


6 


119 


9 


Sail mast 


7 


439 


1 


Hull 


8 


34 


11 


Rudder 



Table 1: All extracted features obtained from the algorithm 
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based on Proposition [TJ When the test indicates the convergence is achieved, it is guaranteed that 
the supports of u and v have been correctly identified for the rank-one optimal solution X. On 
the other hand, because the test uses heuristics to find multipliers, it may occur that the optimal 
supports are attained and yet the test fails to indicate that. In this sailboat example, we ran the 
algorithm with 8 different matrices, Aq = A, A±, . . . , Aj with subsequent extraction of features one 
by one. The additional stopping criterion works for 4 out of 8 matrices and we obtain a significant 
reduction in both computational time and number of iterations while maintaining highly accurate 
solutions (e s = 10~ 10 ). Table [2] shows these improvements with 6 = 0.2, where (DDPA)/( ADDPA) is 
the algorithm without/with the additional stopping criteria. The number of iterations with (ADDPA) 
is either 10 or 20 since in this example, we only test the additional stopping criterion once per 10 (outer) 
iterations. We can see that there are cases when the additional stopping criterion can be used very 
early to stop the algorithm with a guaranteed highly accurate solution. 



Matrix 


(DPPA) 


(ADPPA) 


A 4 


(59, 325, 9.36s, 0.00s) 


(20, 123, 6.36s, 2.32s) 


A 5 


(52, 267, 7.28s, 0.00s) 


(10, 60, 4.23s, 2.39s) 


A 6 


(62,499, 12.4s, 0.00s) 


(20, 187,6.96s, 1.84s) 


A 7 


(29, 148, 3.59s, 0.00s) 


(10, 57, 4.00s, 2.48s) 



Table 2: Outer iteration number, inner iteration number, total computational time, and convergence 
testing time for (DDPA)/ (ADDPA) 

6.2 Image Database Test Case 

We conduct the experiment on the Prey face dataset, which consists of 1965 registered face images of 
size 28 x 20. The matrix A has the size of 1965 x 560, where each column represents a single face image. 
We again use the dual algorithm with the additional stopping criterion and maintain all parameters 
the same as in the previous example. The additional stopping criterion is less effective in this test case. 
However, when it works, we again have a significant improvement in computational time and number 
of iterations. For example, with Aq = A and 9 = 0.2, we have the following results for (DDPA) and 
(ADDPA) respectively: (245, 6466, 1.21 x 10 3 s, 0.00s) and (100, 2140, 4.25 x 10 2 s, 23.7s), where the tuple 
is explained in the caption of Figure [2j 

We apply the algorithm to the matrix A and Figure H] shows the curve C obtained with different 
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Figure 4: Feature largeness vs. feature averaging measure for different 9 

values of 6. We select = 0.2 at the curviest point on which indicates the balance between feature 
largeness and feature averaging measure. We obtain the feature Ui and the significance vector v\ 
indicating how strong the appearance of that feature is in each image. The feature is composed of 38 
pixels and there are 1557 images that are considered to have this feature with the significance factor 
of at least 95.14%, where the significance factor of the feature in image j is defined as vi(j)/\\v\ ||oo. 
Figure [5] presents the first feature and the face image that has the significance factor of 100% for this 
feature. Basically, the first feature shows the right forehead, a part of right cheekbone, and the tip of 
the nose. This feature is common among the images (1557 of them), there are images that do not share 
the feature. Figure [6] shows one of such images. 

We remove the first feature from all images that share that feature and continue to find new features. 
Table [3] shows the size of the feature i (sj), number of images that share the feature i (rii), and the 
minimum significance factor for each feature i (/™ in ), i = 1, ... ,10. 

Figure El an d E] presents each feature and the ten face images that have the highest significance 
factors for that feature. 

The features are not easy to observe or distinguish. However, with images that have high significance 
factors, we can see that some features could be associated with a certain orientation of the face or lighting 
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Figure 5: First feature and the image that has the highest significance factor 




Figure 6: An image without the first feature 
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38 


1557 


95.14% 


6 


28 


673 


83.27% 


2 


27 


896 


92.19% 


7 


21 


578 


80.38% 


3 


29 


1096 


87.61% 


8 


20 


555 


87.59% 


4 


24 


847 


83.12% 


9 


35 


291 


80.73% 


5 


25 


791 


83.67% 


10 


13 


598 


71.31% 



Table 3: Information of the first ten extracted features 
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Figure 7: First five features and images with highest significance factors 
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Figure 8: Feature 6 to Feature 10 and images with highest significant factors 

of images. For example, Feature 4 and Feature 9 clearly show the right (or left) cheek when Frey faces 
left (or right). Certain lighting of the background can also define features, which is the case of Feature 
3 and Feature 7. Another observation is that since this approach favors large submatrices, which means 
other features defined by small entries (dark pixels), for examples, eyes or mouth, will not be picked up 
as major features. In this particular application of visual features, we can define negative features, which 
correspond to the features of the negative images. In order to find these negative features, we construct 
the negative images and apply the algorithm to this set of images. In this example, the algorithm is 
applied to B = 255E — A, where E is the matrix of all ones. The coefficient 255 appears due to the 
range of pixel intensities in these images. For each feature u extracted from B, we define u n = 255e — u, 
where e is the vector of all ones, as the negative feature of the original set of images. The first three 
extracted features are presented in Figure The first negative feature has both straight dark eyes with 
two dark background columns at both sides on top. The second one focuses on the darker right eye, 
the left nostril, and also the chin. And the third one is a long dark background column on the top left. 
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Figure 9: First three negative features and images with highest significance factors 
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